This report explores white wine quality data set, which contains 4898 white wines, with 11 variables on quantifying the chemical properties of each wine, at least 3 wine experts rated the quality of each wine, providing a rating between 0(very bad), and 10 (very excellent), SO THE BIG QUESTION IS :
**Which chemical properties influence the quality of white wines?**
This data set will be analyzed to eventually answer this question.
In the first, let’s take a general look on the data set:
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
all of variables are numbers, the x variable shows an id for every wine, and the quality is an integer number.the quality is the response variable that we want to know which of the other variables are really can be used to predict it.
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
here is a quick look to the range of values for each variable with some statistics for them, for example, we notice that the minimum quality is 3 and the maximum quality is 9.
From this plot,the shape is normally distributed, the rate 6 has the highest number of wines then rate 5, and the rate 9 and 3 have the lowest number, so I think from this plot it is hard to make a very excellent wine. so we are in need to do EDA .
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
this table supports the previous plot, that five wines have rate 9 , and 20 wines have rate 3. and the summary tells us that 75% of the data rated 6 and less, the mean and median have a very close values, that the shape is normally distributed.
It’s good to create a new variable called quality_rate, that give the 3,4,5 rates a bad rate, the 6 is good rate , and 7,8,9 are an excellent rate.
##
## Excellent Good Bad
## 1060 2198 1640
I will scale the x axis to see the distribution in a deep look in the histogram, from box plots we can see many outliers in all rates, we can do scaling at y axis to see the median for each rate.
the fixed acidity has peaks at 6.8, 6.6, 6.4. I put the bin width very small to see clearly the count for every value, that from the table of this variable the values are one decimal number x.x except three values from the plot we can see it, 6.15, 6.45, 7.15 . the shape is normally distributed.that the bulk of the data are in the middle of the plot between (5.8 -7.8)
From box plots there is not a big difference between the three rates for the fixed acidity variable on average, I will take the statistical summary to support that:
## white_wine$quality_rate: Excellent
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.900 6.200 6.700 6.725 7.200 9.200
## --------------------------------------------------------
## white_wine$quality_rate: Good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.838 7.300 14.200
## --------------------------------------------------------
## white_wine$quality_rate: Bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.200 6.400 6.800 6.962 7.500 11.800
the median values are very close from each other, because of the outliers, it is better to take the median as the measure of the center(6.7, 6.8, 6.8) for three rates. but we can notice that the maximum value of this variable for excellent rate is 9.2 g/dm^3, which is the least maximum value.
the shape of the histogram is right skewed distribution, transforming it to a normal distribution using the log function is better, and doing some scaling on box plots. there are lot of outliers for all rates, we can do some scaling to see the quantiles more clearly.
there are many gaps when setting the bin width very small.the range of the values(.08-1.1), on average the bad rate has the maximum amount of volatile acidity, good and excellent rates have the same amount of it.we can support our investigation using the statistical summary.
## white_wine$quality_rate: Excellent
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.1900 0.2500 0.2653 0.3200 0.7600
## --------------------------------------------------------
## white_wine$quality_rate: Good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2000 0.2500 0.2606 0.3000 0.9650
## --------------------------------------------------------
## white_wine$quality_rate: Bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1000 0.2400 0.2900 0.3103 0.3500 1.1000
from the summary, the maximum value for the excellent rate is 0,76 g/dm^3, which is the least maximum value.
from the histogram, the shape is normally distributed, doing some scaling on x axis is better to see the values of citric acid, and the box plots have lots of outlieres, doing some limits on y axis is good :
from this plot, we see a normal distribution for the citric acid and the bulk of the data is in the range between .2 and .5, there is a sudden peak at .49, this variable has values as x.xx , that has two decimal numbers. and from the box plots, on average the citric acid values for all rate are very close, doing statistical summary is good to see the values of medians .
## white_wine$quality_rate: Excellent
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0100 0.2800 0.3100 0.3261 0.3600 0.7400
## --------------------------------------------------------
## white_wine$quality_rate: Good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.270 0.320 0.338 0.380 1.660
## --------------------------------------------------------
## white_wine$quality_rate: Bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2400 0.3200 0.3343 0.4100 1.0000
yes, they are very close (.32, .31, .32), and the maximum value for the excellent rate is .74 g/dm^3 which is the least one.
the shape of distribution is right skewed, so transforming this variable to be normal distributed is better to understand the distribution. and scaling the box plots to get the quantiles,there are many outliers in excellent, and good rates. there is an outlier more than 45 that means this wine is considered as sweet, which was rated as good. the question here, what is the average amount of the residual sugar that was added to the excellent rate wines?
this is a bi modal distribution, that shows the most values between 1 and 20 (g/dm^3), 4 is the middle values on x axis, It has now many peaks,we can see the difference in the residual sugar between the rates, that on average the bad rate has the maximum amount of sugar, the minimum amount of sugar for the excellent rate which is equal to 3.8 g/dm^3, that makes a sense for our investigation.
We can see the statistical summary.
## white_wine$quality_rate: Excellent
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.800 1.800 3.875 5.262 7.400 19.250
## --------------------------------------------------------
## white_wine$quality_rate: Good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.700 1.700 5.300 6.442 9.900 65.800
## --------------------------------------------------------
## white_wine$quality_rate: Bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 6.625 7.054 11.025 23.500
We can notice the difference between the medians and third quantiles for each quality rate in the residual sugar variable, that 75% of the wines that is rated as excellent have a residual sugar amount not more than 7.4g/dm^3
the shape is right skewed distribution, so it needs transformation, there are lots of outliers, so it needs scaling, that the quantiles don’t appear very well, So what is the difference between the quality rates in the chlorides variable?:
It’s clear that the bulk of the data are between .03 and .06 (g/dm^3) for the amount of sodium chloride salt. there are many values at the right of the shape, it seems outliers. On average the amount of sodium chlorides is increasing gradually according to the rates from excellent to bad, the excellent rate has the least amount on average, I think this is one of the chemical property that influence the quality of the wine, we can take the statistical summary:
## white_wine$quality_rate: Excellent
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.03100 0.03700 0.03816 0.04400 0.13500
## --------------------------------------------------------
## white_wine$quality_rate: Good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01500 0.03600 0.04300 0.04522 0.04900 0.25500
## --------------------------------------------------------
## white_wine$quality_rate: Bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.04000 0.04700 0.05144 0.05300 0.34600
the median of the excellent rate for the chlorides equals .037g/dm^3, the good rate equals .043, and the bad rate equals .047, I think the taste of sodium chloride become evident in the taste with a high amount, and that is bad.
the shape is right skewed, I will transform it to a normal distribution using the square root function, It seems there is no difference in this variable for the three rates,there is an outlier that equals to 289, it is very big value compared to the others.we can see that on the bad rate, the question is what is the average of free sulfur dioxide that is added to the wines? I will take a closer look at it:
the free sulfer dioxide prevents microbial growth and the oxidation of wine, the bulk of the data are between 20 and 60 (mg/dm^3), there are many peaks here, from box plots, the amount of this variable is very close for all rates on average.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
On average the free sulfur dioxide is added to the wines with 35,31 mg/dm^3.
the shape is a normal distribution,there are many outliers, from the histogram there is an outlier more than 400,from box plots, it is from the bad rate, take a small bin width, and take a closer look to the quantiles, but before that what is the range of this property that the bulk of the data are in?
the bulk of the data are between (80 - 200)mg/dm^3, there is a clear difference between the rates for the total sulfur dioxide on average, the big values are in the bad rate, these notes make a sense for us, especially this property affects on the taste and smell of the wine.
The shape is normal distribution, from the box plots the bulk of the data is between .99 and 1,the most of outliers are in the excellent rate,but the high values of outliers are in the Good rate, so scaling the histogram and box plots are good.
there is a clear difference in the density of the water between the rates, that on average the bad rate have the highest amount of the density of the water, and the excellent rate have the lowest one.on other wise the bulk of the data are between .99 and 1 that means the difference is just with .00x amounts, so what is affected on the density of the wine to be close to 1, which is the density of the water? I will investigate that later.
The pH value investigation
The distribution of the pH is normally distributed, and the bulk of the data are between (2.9 - 3.4), from the box plots, we can see a lot of outliers, that are not more than 4 , I will take the statistical summary for each rate:
## white_wine$quality_rate: Excellent
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.840 3.100 3.200 3.215 3.320 3.820
## --------------------------------------------------------
## white_wine$quality_rate: Good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.080 3.180 3.189 3.280 3.810
## --------------------------------------------------------
## white_wine$quality_rate: Bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.79 3.08 3.16 3.17 3.24 3.79
So there is no clear difference between the rates on pH values, that they are on average have a very close values of pH.
but what is the most value of pH that is used in the wines?
##
## 2.72 2.74 2.77 2.79 2.8 2.82 2.83 2.84 2.85 2.86 2.87 2.88 2.89 2.9 2.91
## 1 1 1 3 3 1 4 1 9 9 9 11 17 31 15
## 2.92 2.93 2.94 2.95 2.96 2.97 2.98 2.99 3 3.01 3.02 3.03 3.04 3.05 3.06
## 18 38 35 26 63 32 41 68 74 49 68 78 97 89 115
## 3.07 3.08 3.09 3.1 3.11 3.12 3.13 3.14 3.15 3.16 3.17 3.18 3.19 3.2 3.21
## 79 136 92 135 126 134 117 172 136 164 124 138 145 137 95
## 3.22 3.23 3.24 3.25 3.26 3.27 3.28 3.29 3.3 3.31 3.32 3.33 3.34 3.35 3.36
## 146 116 132 114 96 88 87 82 93 79 86 49 79 48 83
## 3.37 3.38 3.39 3.4 3.41 3.42 3.43 3.44 3.45 3.46 3.47 3.48 3.49 3.5 3.51
## 49 58 40 39 30 48 20 33 17 28 21 21 23 15 14
## 3.52 3.53 3.54 3.55 3.56 3.57 3.58 3.59 3.6 3.61 3.62 3.63 3.64 3.65 3.66
## 17 13 14 9 8 5 5 6 7 3 1 6 2 4 5
## 3.67 3.68 3.69 3.7 3.72 3.74 3.75 3.76 3.77 3.79 3.8 3.81 3.82
## 1 2 2 1 3 2 2 2 2 1 2 1 1
the most pH values are used in the wines are between 3.1 and 3.25.
there are outliers in every rate, and the middle 50% for the three rates is between (0.4 - 0.6), the distribution is normally distributed.
it is clear for all quality rates, that on average the amount of potassium sulphates is very close from each other. But what is the average of the sulphates that is added to the wines?
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
on average .47 g/dm^3 from the sulphates is added to the wines.
The bulk of the data are in the range(8.5 - 13)%, taking a better scaling for the plots:
On average,The Excellent rate has the highest percent of alcohol by the volume(11.5%), and the bad rate has the lowest percent of alcohol by volume(9.6%), and the middle 50% values of the excellent rate don’t cross the middle 50% bad rate values, so I think this property has a strong effect on the taste of the wine, we will see that soon.
the white wine quality data set has 4898 wines, with 11 chemical properties, and quality variable(rated from 0 which is the worst to 10 which is the best), all of these properties are numbers:
1 - fixed acidity (tartaric acid - g / dm^3) 2 - volatile acidity (acetic acid - g / dm^3) 3 - citric acid (g / dm^3) 4 - residual sugar (g / dm^3) 5 - chlorides (sodium chloride - g / dm^3) 6 - free sulfur dioxide (mg / dm^3) 7 - total sulfur dioxide (mg / dm^3) 8 - density (g / cm^3) 9 - pH 10 - sulphates (potassium sulphate - g / dm3) 11 - alcohol (% by volume)
observations: 1. most of the wines are rated as 5, 6 quality. 2.the rate 9 , and 3 have the least number of samples compared to the others. 3.the residual sugar amount in the wines varies between 1 and 20 g/dm^3, and on average the high rates(7,8,9) have the minimum amount of it. 4.the most of wines have .3 - .6 g/dm^3 amount of the sodium chloride, and on average the high quality rates have the minimum amount of it. 5.the density of the water for the most of wines varies between .99 and 1 g/cm^3, and on average the high rates have the least density of water. 6. the pH values varies between 2.9 and 3.4, with outliers not more than 3.82. 7. the total sulfur dioxide property has outliers more than 300, located at the low rates. this property affects on the taste of wines with high concentration. 8.the citric acid amount varies between .2 and .5 for the most wines, and there are outliers more than 1 g/dm^3. 9.Alcohol percent for the most of the white wines is between 8.5% and 13% of the volume of wine, and on average the high rates have the maximum amount of it with 11.5%. 10 In general, all of these properties have outliers .
The main features in the data set are the quality and Alcohol, and for sure we want to know which features are influence in the quality of the wine, I think alcohol is one of these features.
the residual sugar, density, the total sulfur dioxide, and sodium chloride, these features have a difference in their values between the quality rates according to the box plots.
Yes, I have created the quality_rate variable, because of the variation in the number of observations for each rate, that 9 rate has just 5 observations, and 3 rate has 20 observation , on the other hand, the rate 6 and 5 have thousands of observations.
Yes, I have transformed many features like:
1.volatile acidity: transformed from right skewed to be a normal distributed, using the log10 transformation. 2.the residual sugar feature: transformed from right skewed to be a bi modal distribution, using the log10 transformation. 3.the chlorides feature: transformed from right skewed to be normal distribution using the log10 transformation. 4.the free sulfur dioxide : transformed from right skewed to be normal distribution using the square root transformation.
I transformed these features to be normally distributed or a distribution that more closely resembles as a normal distribution, to understand how the data flow for these features, that a lot of common statistical techniques, like linear regression, are based on the assumption that the data is a normal distribution.
In this section, I will investigate the features using scatter plots,correlations, and conditional means.
From this scatter plot matrix, we can see the relationships between the variables, and the correlation coefficients between each other.
(alcohol - density) : -0.78 (residual sugar - density) : 0.839
(fixed acidity - pH) : -0.426 (residual sugar - total sulfur dioxide): 0.401 (residual sugar - alcohol) : -0.451 (chlorides - alcohol): -0.36 (total sulfur dioxide - density): 0.53 (total sulfur dioxide - alcohol): -0.449 (alcohol - quality): 0.436 (density - quality): -0.307
these are the most important relationships in this data set that I will investigate:
From the coefficient correlation, this relationship is strong, but The question here : what are the percents of alcohol that produce a high density, and the percents that produce a low density?
The bulk of the data are between .99 and 1 for the density values , after many iterations of scaling and solving the overploting using alpha value, I plot this scatter plot:
the strong linear relationship between alcohol and the density is very clear in the scatter plot, and the vertical lines are because of effect of other variables. On average, the wines with the lowest amount of alcohol have the highest values of the density that more than .995 g/cm^3, while the wines with the highest amount of alcohol have not more than .992 g/cm^3.
the question is : what are the amounts of residual sugar that produce a high density, and the amounts that produce a low density?
there is an outlier with residual sugar more than 60.
The linear relationship is clear, but doing some scaling, removing outliers, and solving the overplotting is better:
they are jumping up and down here, but the increasing in the density as the residual sugar is increasing is clear, on average the wines with residual sugar more than 15g/dm^3 produce a wine with density more than .996 g/ dm^3. and the wines with residual sugar less than 3 have a density with less than .992g/dm^3.
I will take the overplotting region,it is almost between .5 and 2.5 on x-axis :
I think the vertical lines for each x value have reasons due to other variables, and if we return to the histogram of the residual sugar,we can see peaks in this area.
The relationship between the total sulfur dioxide and density is not very strong but it is in a moderate level, so I think we can ask that what are the amounts of the total sulfur dioxide that produce wines with high density, and amounts that produce wines with low density?
the scaling ,solving overplotting ,and adding smoother to the plot directly:
There is a positive linear relationship here, but also lot of data, I think I reached to a dead end here, so I will create a new data set that has a conditional means, to understand how the mean of the density varies with the total sulfur dioxide, From the histogram of the total sulfur dioxide, we see that the bulk of the data are between 80 and 200, so I will subset the data first:
the positive linear relationship is clear with conditional means, with many jumps at many points, for example at 75-100 mg/dm^3 the density is not more than .993, and at 175 - 200 mg/dm^3 the density is not less than .994 g/dm^3.
what are the amounts of residual sugar that added when the alcohol percent is high, and when alcohol percent is low? is there a linear relationship as what we see from the correlation coefficient that equals to -.451?
solving the overplotting problem:
I will Create a conditional means of the residual sugar amount for every alcohol value rounded to two decimal value, by creating a new data set:
the points are jumping up and down, but on average the highest amount of alcohol wines have the lowest amounts of the residual sugar, that these wines have residual sugar amounts not more than 6 g/dm^3, while on average the wines with the lowest amount of alcohol have amounts of the residual sugar more than 9 g/dm^3 So the relationship is negative.
what are the amounts of the total sulfur dioxide that added to the wines when alcohol percent is high, and when alcohol percent is low? I will investigate this relationship directly using the summary statistics to avoid any dead end status.
according to the mean summary on this scatter plot, the wines that have the least amounts of alcohol, have also amounts of the total sulfur dioxide more than 150 mg/dm^3 on average, then this amounts are decreasing, while the amounts of alcohol are increasing in the wines, that the highest amounts of alcohol are between 100- 125 mg/dm^3 on average.
the question is :what are the amounts of the chlorides that added to the wines when alcohol percent is high, and when alcohol percent is low?
After many iterations of plotting in this relationship:
From this plot, that I have put a vertical line in the half of alcohol data, we can notice that in the first half, On average the chlorides amount in most of the wines is not less than .045g/dm^3, on the other hand, on the second half on average, the most of wines is less than .045 g/dm^3. the relationship here is not very strong , but I think it worth to investigate.
There is a relationship between the pH and the fixed acidity, but from the histogram of the pH, the bulk of the data are between 2.9 and 3.4, So what is the range of pH value that takes a high fixed acidity value, and the range that takes a low fixed acidity value?
there is a negative linear relationship between the fixed acidity and the pH, I like this relationship, that there is no long up and down jumping, and the smooth line is on the means. On average the wines with high level of fixed acidity have a pH less than 3.2, and the wines with low level of fixed acidity have a pH more than 3.2.
alcohol and density features have a moderate linear relationship with the quality variable, plotting the density plots for each quality rate with(alcohol, density, chlorides,residual sugar, and total sulfur dioxide) gives us observations:
The density of the wine is varying according to many variables, that it is decreasing while the percent of alcohol is increasing, and vice versa. it is increasing while the residual sugar is increasing, and vice versa. and it is increasing while the total sulfur dioxide is increasing.
alcohol percent in the wine is increasing while the total sulfur dioxide is decreasing, the residual sugar is decreasing, and the chlorides is decreasing. So it has a negative linear relationships with total sulfur dioxide, the residual sugar, and the chlorides.
The quality of the wine is affected with alcohol percent, residual sugar, total sulfur dioxide, and chlorides.
Yes, I observed the relationship between the pH and fixed acidity, that there is a negative relationship, while the fixed acidity is increasing, the pH is decreasing.
the strongest relationships are between the density and alcohol, and between the density and the residual sugar.
The question is : according to the previous investigations the wine with high percent of alcohol, is the residual sugar amount is responsible to be as bad or good quality rates?
This plot frankly arrived me to a dead end that the highest values of alcohol are as a good quality or excellent quality, I think doing conditional medians here is useful to solve this problem.
This plot tells us that at low percent of alcohol ,it is better to be the residual sugar in a high amount, that a low amount will produce a bad quality rate, and while alcohol is increasing, the residual sugar should decrease but with a moderate level of it, for example that when alcohol equals 10, we see the excellent rate has 6g/dm^3, the good has 7g/dm^3, and the bad has the lowest value. on the other hand in the range between 11.5 and 13 percent of alcohol we see the excellent rate has the highest amount of the residual sugar but not more than 4 g/dm^3.
This plot arrived me to a dead end too, so it is better to do some conditional medians on the total sulfur dioxide per alcohol value:
This plot tells us that on average when the alcohol percent is low, the excellent quality rate has the lowest amount of the sulfur dioxide in some points, and when alcohol percent is high the excellent rate has the highest amount of the total sulfur dioxide in some points, but there is a lot of peaks here lead me to do a another deep investigation with four variables:
in this section, I will investigate alcohol, quality_rate, residual sugar and total sulfur dioxide at the same plot:
I will create a new variable called alcohol.percent, that is a factor variable with three levels of alcohol, (8-10),(10-12),(12-14), then I will rename the levels of this variable to be (low, moderate, high)
## [1] "low" "moderate" "high"
we can see at a high level of alcohol there are few data points of the bad quality, and at a low level of alcohol there are few data points of the excellent quality, we see that the good quality in all of the levels of alcohol. Although this plot tells us a lot, but still the relationship between the residual sugar and the total sulfur dioxide doesn’t clear, especially in the (10-12) alcohol percent, this is a dead end, the conditional means her is very good.
Wow, a lot of information here, the decreasing in the total sulfur dioxide is clear between the levels of alcohol, I think the taste of wine with a high amount of the total sulfur dioxide with a high percent of alcohol will be very bad, so they decrease the amount of the total sulfur dioxide as they increase the alcohol.
we can see also that in the level (10-12) of alcohol, the excellent quality rate has an amount of residual sugar not more than 15g/dm^3.
From the grand median which is with dashed blue line, it approve to us that in the fist level the most of the wines are from the good and bad quality rates, in the second level the wines are from all quality rates, and the third level the most of wines are from excellent and good rates.
from the previous investigation, the relationships between alcohol,chlorides, residual sugar, total sulfur dioxide and the quality are linear, so to predict the quality of the a wine, building a linear model will be good for that:
##
## Calls:
## m1: lm(formula = I(quality) ~ I(alcohol), data = white_wine)
## m2: lm(formula = I(quality) ~ I(alcohol) + residual.sugar, data = white_wine)
## m3: lm(formula = I(quality) ~ I(alcohol) + residual.sugar + total.sulfur.dioxide,
## data = white_wine)
## m4: lm(formula = I(quality) ~ I(alcohol) + residual.sugar + total.sulfur.dioxide +
## chlorides, data = white_wine)
##
## ================================================================================
## m1 m2 m3 m4
## --------------------------------------------------------------------------------
## (Intercept) 2.582*** 2.021*** 2.048*** 2.283***
## (0.098) (0.117) (0.139) (0.153)
## I(alcohol) 0.313*** 0.354*** 0.352*** 0.339***
## (0.009) (0.010) (0.011) (0.012)
## residual.sugar 0.022*** 0.022*** 0.021***
## (0.002) (0.003) (0.003)
## total.sulfur.dioxide -0.000 -0.000
## (0.000) (0.000)
## chlorides -2.058***
## (0.558)
## --------------------------------------------------------------------------------
## R-squared 0.190 0.202 0.202 0.204
## adj. R-squared 0.190 0.202 0.201 0.204
## sigma 0.797 0.791 0.791 0.790
## F 1146.395 619.354 412.870 313.857
## p 0.000 0.000 0.000 0.000
## Log-likelihood -5839.391 -5802.158 -5802.097 -5795.287
## Deviance 3112.257 3065.298 3065.221 3056.709
## AIC 11684.782 11612.317 11614.193 11602.574
## BIC 11704.272 11638.303 11646.676 11641.553
## N 4898 4898 4898 4898
## ================================================================================
the relationship between the residual sugar and alcohol for each quality rate tells us that the wine to be as an excellent rate, the residual sugar should be high with a low percent of alcohol, or a moderate amount of the residual sugar with a moderate percent of alcohol, or a low amount of residual sugar with a high percent of alcohol.
the relationship of the total sulfur dioxide with alcohol for each quality rate didn’t very clear for me, I think because of the effect of the residual sugar, so I did an investigation with these four variables to see the relationship, I created a new factor variable with the levels of alcohol, then faceting the plots by it, and the findings surprised me.
So I think the features that strengthen each other when we want to investigate the quality rate, are the residual sugar, total sulfur dioxide, and alcohol.
Yes, there are, that I see if the alcohol percent is low ,then the residual sugar and the total sulfur dioxide should be the lowest to be as an excellent rate and as the residual sugar is increasing the sulfur dioxide is increasing to be the highest residual sugar with the highest sulfur dioxide as an excellent rate.
Yes I did a linear regression model, the strength of this model that is really used the most variables that affect on the quality of the wines according to our investigation which have a linear relationship with each other, but the value of R- squared is bad which equals to .2 for the all models, that the R-squared is a statistical measure of how close the data are to the fitted regression line. when this value is close to 1, that means the model fits the data very well.
The plot describes the percent of alcohol by volume, that the most wines have (8.7-13)% of alcohol, and more than 500 of the wines have around 9.5% of alcohol(the two peaks in the histogram), the most interesting observation here is the wines that rated as excellent quality have the highest percent of alcohol, and the wines that rated good have a moderate percent of alcohol, and the wines that rated as bad have the lowest percent of alcohol.
## white_wine$quality_rate: Excellent
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 10.70 11.50 11.42 12.40 14.20
## --------------------------------------------------------
## white_wine$quality_rate: Good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 9.60 10.50 10.58 11.40 14.00
## --------------------------------------------------------
## white_wine$quality_rate: Bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.20 9.60 9.85 10.40 13.60
alcohol and density features have a moderate linear relationship with the quality variable, plotting the density plots for each quality rate with(alcohol, density, chlorides,residual sugar, and total sulfur dioxide) gives us observations:
these statistical summaries support the finding off this plot:
## white_wine$quality_rate: Excellent
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 10.70 11.50 11.42 12.40 14.20
## --------------------------------------------------------
## white_wine$quality_rate: Good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 9.60 10.50 10.58 11.40 14.00
## --------------------------------------------------------
## white_wine$quality_rate: Bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.20 9.60 9.85 10.40 13.60
## white_wine$quality_rate: Excellent
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.03100 0.03700 0.03816 0.04400 0.13500
## --------------------------------------------------------
## white_wine$quality_rate: Good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01500 0.03600 0.04300 0.04522 0.04900 0.25500
## --------------------------------------------------------
## white_wine$quality_rate: Bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.04000 0.04700 0.05144 0.05300 0.34600
## white_wine$quality_rate: Excellent
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.800 1.800 3.875 5.262 7.400 19.250
## --------------------------------------------------------
## white_wine$quality_rate: Good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.700 1.700 5.300 6.442 9.900 65.800
## --------------------------------------------------------
## white_wine$quality_rate: Bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 6.625 7.054 11.025 23.500
## white_wine$quality_rate: Excellent
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34.0 101.0 122.0 125.2 146.0 229.0
## --------------------------------------------------------
## white_wine$quality_rate: Good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.0 107.2 132.0 137.0 164.0 294.0
## --------------------------------------------------------
## white_wine$quality_rate: Bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 117.0 149.0 148.6 182.0 440.0
This plot describes alcohol percent, the total sulfur dioxide, and the residual sugar for each quality rate, the wines with a low percent of alcohol most of them rated as good and bad, that we can see from the Grand mean(dashed line), and the wines with a high percent of alcohol most of them rated as good and excellent quality(from the Grand mean too).
The linear relationship between the residual sugar and alcohol is positive, So as the residual sugar is increasing, the total sulfur dioxide is increasing too in all of the quality rates.
the wines with the excellent quality rate don’t have more than 16 g/dm^3 of the residual sugar except with a low percent of alcohol. So the wines with excellent quality with a moderate level of alcohol have the lowest amount of the total sulfur dioxide,and they increasing until they have a moderate amount of the residual sugar, then they will have a total sulfur dioxide amount more than the bad wines.
this statistical summaries support what we find from this plot.
## alcohol.low$quality_rate: Excellent
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 84.0 135.0 163.0 161.3 189.0 229.0
## --------------------------------------------------------
## alcohol.low$quality_rate: Good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.0 125.0 157.0 156.8 186.0 272.0
## --------------------------------------------------------
## alcohol.low$quality_rate: Bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 24.0 133.0 163.0 160.7 189.0 344.0
## [1] "-----------------------------------------------------"
## alcohol.moderate$quality_rate: Excellent
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 37.0 101.5 122.0 122.0 142.0 209.0
## --------------------------------------------------------
## alcohol.moderate$quality_rate: Good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 24.0 102.0 124.0 128.6 154.0 253.0
## --------------------------------------------------------
## alcohol.moderate$quality_rate: Bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 21.0 95.0 126.0 126.6 156.0 440.0
## [1] "-----------------------------------------------------"
## alcohol.high$quality_rate: Excellent
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34.0 97.0 113.0 114.6 128.0 197.0
## --------------------------------------------------------
## alcohol.high$quality_rate: Good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 28.0 92.0 111.0 112.6 128.0 294.0
## --------------------------------------------------------
## alcohol.high$quality_rate: Bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.00 71.00 89.50 85.81 107.50 156.00
## [1] "-----------------------------------------------------"
## alcohol.low$quality_rate: Excellent
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.100 8.113 13.350 11.064 14.400 19.250
## --------------------------------------------------------
## alcohol.low$quality_rate: Good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.700 5.275 9.450 9.264 13.200 31.600
## --------------------------------------------------------
## alcohol.low$quality_rate: Bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.700 3.500 8.100 8.406 12.300 23.500
## [1] "-----------------------------------------------------"
## alcohol.moderate$quality_rate: Excellent
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.600 2.700 4.383 6.650 15.100
## --------------------------------------------------------
## alcohol.moderate$quality_rate: Good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.700 1.500 3.350 5.008 7.500 65.800
## --------------------------------------------------------
## alcohol.moderate$quality_rate: Bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.300 2.000 4.377 6.800 18.950
## [1] "-----------------------------------------------------"
## alcohol.high$quality_rate: Excellent
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.800 1.900 3.500 4.046 5.200 15.500
## --------------------------------------------------------
## alcohol.high$quality_rate: Good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.800 1.600 2.450 3.723 5.000 20.300
## --------------------------------------------------------
## alcohol.high$quality_rate: Bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.400 2.350 4.050 4.775 22.600
The white wine data set contains 4898 white wines with the amounts of their chemical properties and the quality that is rated by experts, In this data set the big question was :
Which chemical properties influence the quality of white wines?
I started by doing analysis on each property alone to see its distribution, and side by side its relationship with the quality of the wines using the box plot, from this analysis , there was a lot of findings but the most interesting thing is the alcohol feature, which was very clear that it is an important factor in the quality of the wine. then I did a bi variate analysis between the variables that have a linear relationship with each other, that I used the scatter plot matrix and the correlation coefficients to find these variables, one of these investigation is the relationship between the density and alcohol, and between the density and the residual sugar, which were interesting for me may be because of the strong linear relationship between them, then I did a multivariate analysis with the quality rate, there were many difficulties in some points but on the other wise, there were successes in solving a dead ends problem.
I ran into difficulties in the analysis, the first one was I can’t understand the chemical properties, but by reading about this data set I became familiar with them, especially the units were clear in the text, and that helped me in understanding the wines and how the chemicals are added to wines,another thing is the overplotting and dead ends situations in many cases, that there are a lot of data points in some region with no relationships seems to me, so when using alpha with value .5, taking summaries on the scatter plots, and putting smoother, the analysis flowed, and the relationships became more clear.
Although I found difficulties in my analysis, I also found successes in it, the relationship between quality and alcohol, solving the overplotting and dead ends with alpha value, and the conditional means and medians, the linear relationship between alcohol, residual sugar with the density ,the relationship between the variables and the quality rates,all of these were successes for me, but the biggest one was when I put the four variable in one plot (residual sugar, total sulfur dioxide, alcohol, quality), I found a large amount of information and supporting it with statistical summaries to enable the reader to find any information he wants.
I think we still can’t predict the quality of the wine by the linear regression model, that the R-squared is very low, so we want other model doing the prediction, one of the machine learning algorithms will be suitable for that.
2.http://www.cookbook-r.com/Manipulating_data/Renaming_levels_of_a_factor/